prosperLoanData Exploration

by Mohamed Adel

Preliminary Wrangling

This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others.

In [19]:
# import all packages and set plots to be embedded inline
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.

In [20]:
# load in the dataset into a pandas dataframe, print statistics
PLD = pd.read_csv('prosperLoanData.csv', dtype = {'LoanStatus' : str})
In [5]:
PLD.head(5)
Out[5]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
0 1021339766868145413AB3B 193129 2007-08-26 19:09:29.263000000 C 36 Completed 2009-08-14 00:00:00 0.16516 0.1580 0.1380 ... -133.18 0.0 0.0 0.0 0.0 1.0 0 0 0.0 258
1 10273602499503308B223C1 1209647 2014-02-27 08:28:07.900000000 NaN 36 Current NaN 0.12016 0.0920 0.0820 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
2 0EE9337825851032864889A 81716 2007-01-05 15:00:47.090000000 HR 36 Completed 2009-12-17 00:00:00 0.28269 0.2750 0.2400 ... -24.20 0.0 0.0 0.0 0.0 1.0 0 0 0.0 41
3 0EF5356002482715299901A 658116 2012-10-22 11:02:35.010000000 NaN 36 Current NaN 0.12528 0.0974 0.0874 ... -108.01 0.0 0.0 0.0 0.0 1.0 0 0 0.0 158
4 0F023589499656230C5E3E2 909464 2013-09-14 18:38:39.097000000 NaN 36 Current NaN 0.24614 0.2085 0.1985 ... -60.27 0.0 0.0 0.0 0.0 1.0 0 0 0.0 20

5 rows × 81 columns

In [6]:
# high-level overview of data shape and composition
print(PLD.shape)
print(PLD.dtypes)
print(PLD.head(10))
(113937, 81)
ListingKey                              object
ListingNumber                            int64
ListingCreationDate                     object
CreditGrade                             object
Term                                     int64
LoanStatus                              object
ClosedDate                              object
BorrowerAPR                            float64
BorrowerRate                           float64
LenderYield                            float64
EstimatedEffectiveYield                float64
EstimatedLoss                          float64
EstimatedReturn                        float64
ProsperRating (numeric)                float64
ProsperRating (Alpha)                   object
ProsperScore                           float64
ListingCategory (numeric)                int64
BorrowerState                           object
Occupation                              object
EmploymentStatus                        object
EmploymentStatusDuration               float64
IsBorrowerHomeowner                       bool
CurrentlyInGroup                          bool
GroupKey                                object
DateCreditPulled                        object
CreditScoreRangeLower                  float64
CreditScoreRangeUpper                  float64
FirstRecordedCreditLine                 object
CurrentCreditLines                     float64
OpenCreditLines                        float64
                                        ...   
TotalProsperLoans                      float64
TotalProsperPaymentsBilled             float64
OnTimeProsperPayments                  float64
ProsperPaymentsLessThanOneMonthLate    float64
ProsperPaymentsOneMonthPlusLate        float64
ProsperPrincipalBorrowed               float64
ProsperPrincipalOutstanding            float64
ScorexChangeAtTimeOfListing            float64
LoanCurrentDaysDelinquent                int64
LoanFirstDefaultedCycleNumber          float64
LoanMonthsSinceOrigination               int64
LoanNumber                               int64
LoanOriginalAmount                       int64
LoanOriginationDate                     object
LoanOriginationQuarter                  object
MemberKey                               object
MonthlyLoanPayment                     float64
LP_CustomerPayments                    float64
LP_CustomerPrincipalPayments           float64
LP_InterestandFees                     float64
LP_ServiceFees                         float64
LP_CollectionFees                      float64
LP_GrossPrincipalLoss                  float64
LP_NetPrincipalLoss                    float64
LP_NonPrincipalRecoverypayments        float64
PercentFunded                          float64
Recommendations                          int64
InvestmentFromFriendsCount               int64
InvestmentFromFriendsAmount            float64
Investors                                int64
Length: 81, dtype: object
                ListingKey  ListingNumber            ListingCreationDate  \
0  1021339766868145413AB3B         193129  2007-08-26 19:09:29.263000000   
1  10273602499503308B223C1        1209647  2014-02-27 08:28:07.900000000   
2  0EE9337825851032864889A          81716  2007-01-05 15:00:47.090000000   
3  0EF5356002482715299901A         658116  2012-10-22 11:02:35.010000000   
4  0F023589499656230C5E3E2         909464  2013-09-14 18:38:39.097000000   
5  0F05359734824199381F61D        1074836  2013-12-14 08:26:37.093000000   
6  0F0A3576754255009D63151         750899  2013-04-12 09:52:56.147000000   
7  0F1035772717087366F9EA7         768193  2013-05-05 06:49:27.493000000   
8  0F043596202561788EA13D5        1023355  2013-12-02 10:43:39.117000000   
9  0F043596202561788EA13D5        1023355  2013-12-02 10:43:39.117000000   

  CreditGrade  Term LoanStatus           ClosedDate  BorrowerAPR  \
0           C    36  Completed  2009-08-14 00:00:00      0.16516   
1         NaN    36    Current                  NaN      0.12016   
2          HR    36  Completed  2009-12-17 00:00:00      0.28269   
3         NaN    36    Current                  NaN      0.12528   
4         NaN    36    Current                  NaN      0.24614   
5         NaN    60    Current                  NaN      0.15425   
6         NaN    36    Current                  NaN      0.31032   
7         NaN    36    Current                  NaN      0.23939   
8         NaN    36    Current                  NaN      0.07620   
9         NaN    36    Current                  NaN      0.07620   

   BorrowerRate  LenderYield    ...     LP_ServiceFees  LP_CollectionFees  \
0        0.1580       0.1380    ...            -133.18                0.0   
1        0.0920       0.0820    ...               0.00                0.0   
2        0.2750       0.2400    ...             -24.20                0.0   
3        0.0974       0.0874    ...            -108.01                0.0   
4        0.2085       0.1985    ...             -60.27                0.0   
5        0.1314       0.1214    ...             -25.33                0.0   
6        0.2712       0.2612    ...             -22.95                0.0   
7        0.2019       0.1919    ...             -69.21                0.0   
8        0.0629       0.0529    ...             -16.77                0.0   
9        0.0629       0.0529    ...             -16.77                0.0   

   LP_GrossPrincipalLoss  LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments  \
0                    0.0                  0.0                             0.0   
1                    0.0                  0.0                             0.0   
2                    0.0                  0.0                             0.0   
3                    0.0                  0.0                             0.0   
4                    0.0                  0.0                             0.0   
5                    0.0                  0.0                             0.0   
6                    0.0                  0.0                             0.0   
7                    0.0                  0.0                             0.0   
8                    0.0                  0.0                             0.0   
9                    0.0                  0.0                             0.0   

   PercentFunded  Recommendations InvestmentFromFriendsCount  \
0            1.0                0                          0   
1            1.0                0                          0   
2            1.0                0                          0   
3            1.0                0                          0   
4            1.0                0                          0   
5            1.0                0                          0   
6            1.0                0                          0   
7            1.0                0                          0   
8            1.0                0                          0   
9            1.0                0                          0   

  InvestmentFromFriendsAmount Investors  
0                         0.0       258  
1                         0.0         1  
2                         0.0        41  
3                         0.0       158  
4                         0.0        20  
5                         0.0         1  
6                         0.0         1  
7                         0.0         1  
8                         0.0         1  
9                         0.0         1  

[10 rows x 81 columns]
In [7]:
PLD.nunique()
Out[7]:
ListingKey                             113066
ListingNumber                          113066
ListingCreationDate                    113064
CreditGrade                                 8
Term                                        3
LoanStatus                                 12
ClosedDate                               2802
BorrowerAPR                              6677
BorrowerRate                             2294
LenderYield                              2283
EstimatedEffectiveYield                  1582
EstimatedLoss                             120
EstimatedReturn                          1476
ProsperRating (numeric)                     7
ProsperRating (Alpha)                       7
ProsperScore                               11
ListingCategory (numeric)                  21
BorrowerState                              51
Occupation                                 67
EmploymentStatus                            8
EmploymentStatusDuration                  605
IsBorrowerHomeowner                         2
CurrentlyInGroup                            2
GroupKey                                  706
DateCreditPulled                       112992
CreditScoreRangeLower                      26
CreditScoreRangeUpper                      26
FirstRecordedCreditLine                 11585
CurrentCreditLines                         54
OpenCreditLines                            52
                                        ...  
TotalProsperLoans                           9
TotalProsperPaymentsBilled                130
OnTimeProsperPayments                     127
ProsperPaymentsLessThanOneMonthLate        41
ProsperPaymentsOneMonthPlusLate            17
ProsperPrincipalBorrowed                 1442
ProsperPrincipalOutstanding             13875
ScorexChangeAtTimeOfListing               372
LoanCurrentDaysDelinquent                2411
LoanFirstDefaultedCycleNumber              43
LoanMonthsSinceOrigination                 94
LoanNumber                             113066
LoanOriginalAmount                       2468
LoanOriginationDate                      1873
LoanOriginationQuarter                     33
MemberKey                               90831
MonthlyLoanPayment                      23567
LP_CustomerPayments                     78523
LP_CustomerPrincipalPayments            49828
LP_InterestandFees                      78751
LP_ServiceFees                          19813
LP_CollectionFees                        6175
LP_GrossPrincipalLoss                   15848
LP_NetPrincipalLoss                     15709
LP_NonPrincipalRecoverypayments          2631
PercentFunded                             719
Recommendations                            17
InvestmentFromFriendsCount                 14
InvestmentFromFriendsAmount               726
Investors                                 751
Length: 81, dtype: int64
In [21]:
ordinal_var_dict = {'LoanStatus': ['FinalPaymentInProgress','Current','Completed' ,'Defaulted','Chargedoff','Past Due (61-90 days)'
                                   ,'Past Due (31-60 days)','Past Due (>120 days)','Past Due (91-120 days)','Past Due (16-30 days)'],
                    'ProsperRating (Alpha)': ['A', 'AA', 'B', 'C', 'D', 'E', 'HR'],
                    'EmploymentStatus': ['Self-employed' , 'Employed' ,'Not available', 'Full-time', 'Other', 
                                         'Retired' , 'Part-time']}

for var in ordinal_var_dict:
    ordered_var = pd.api.types.CategoricalDtype(ordered = True,
                                                categories = ordinal_var_dict[var])
    PLD[var] = PLD[var].astype(ordered_var)
In [22]:
PLD[PLD.isnull().any(axis=1)].count()
Out[22]:
ListingKey                             113937
ListingNumber                          113937
ListingCreationDate                    113937
CreditGrade                             28953
Term                                   113937
LoanStatus                             113126
ClosedDate                              55089
BorrowerAPR                            113912
BorrowerRate                           113937
LenderYield                            113937
EstimatedEffectiveYield                 84853
EstimatedLoss                           84853
EstimatedReturn                         84853
ProsperRating (numeric)                 84853
ProsperRating (Alpha)                   84853
ProsperScore                            84853
ListingCategory (numeric)              113937
BorrowerState                          108422
Occupation                             110349
EmploymentStatus                       110847
EmploymentStatusDuration               106312
IsBorrowerHomeowner                    113937
CurrentlyInGroup                       113937
GroupKey                                13341
DateCreditPulled                       113937
CreditScoreRangeLower                  113346
CreditScoreRangeUpper                  113346
FirstRecordedCreditLine                113240
CurrentCreditLines                     106333
OpenCreditLines                        106333
                                        ...  
TotalProsperLoans                       22085
TotalProsperPaymentsBilled              22085
OnTimeProsperPayments                   22085
ProsperPaymentsLessThanOneMonthLate     22085
ProsperPaymentsOneMonthPlusLate         22085
ProsperPrincipalBorrowed                22085
ProsperPrincipalOutstanding             22085
ScorexChangeAtTimeOfListing             18928
LoanCurrentDaysDelinquent              113937
LoanFirstDefaultedCycleNumber           16952
LoanMonthsSinceOrigination             113937
LoanNumber                             113937
LoanOriginalAmount                     113937
LoanOriginationDate                    113937
LoanOriginationQuarter                 113937
MemberKey                              113937
MonthlyLoanPayment                     113937
LP_CustomerPayments                    113937
LP_CustomerPrincipalPayments           113937
LP_InterestandFees                     113937
LP_ServiceFees                         113937
LP_CollectionFees                      113937
LP_GrossPrincipalLoss                  113937
LP_NetPrincipalLoss                    113937
LP_NonPrincipalRecoverypayments        113937
PercentFunded                          113937
Recommendations                        113937
InvestmentFromFriendsCount             113937
InvestmentFromFriendsAmount            113937
Investors                              113937
Length: 81, dtype: int64
In [23]:
print(PLD.describe())
       ListingNumber           Term    BorrowerAPR   BorrowerRate  \
count   1.139370e+05  113937.000000  113912.000000  113937.000000   
mean    6.278857e+05      40.830248       0.218828       0.192764   
std     3.280762e+05      10.436212       0.080364       0.074818   
min     4.000000e+00      12.000000       0.006530       0.000000   
25%     4.009190e+05      36.000000       0.156290       0.134000   
50%     6.005540e+05      36.000000       0.209760       0.184000   
75%     8.926340e+05      36.000000       0.283810       0.250000   
max     1.255725e+06      60.000000       0.512290       0.497500   

         LenderYield  EstimatedEffectiveYield  EstimatedLoss  EstimatedReturn  \
count  113937.000000             84853.000000   84853.000000     84853.000000   
mean        0.182701                 0.168661       0.080306         0.096068   
std         0.074516                 0.068467       0.046764         0.030403   
min        -0.010000                -0.182700       0.004900        -0.182700   
25%         0.124200                 0.115670       0.042400         0.074080   
50%         0.173000                 0.161500       0.072400         0.091700   
75%         0.240000                 0.224300       0.112000         0.116600   
max         0.492500                 0.319900       0.366000         0.283700   

       ProsperRating (numeric)  ProsperScore      ...        LP_ServiceFees  \
count             84853.000000  84853.000000      ...         113937.000000   
mean                  4.072243      5.950067      ...            -54.725641   
std                   1.673227      2.376501      ...             60.675425   
min                   1.000000      1.000000      ...           -664.870000   
25%                   3.000000      4.000000      ...            -73.180000   
50%                   4.000000      6.000000      ...            -34.440000   
75%                   5.000000      8.000000      ...            -13.920000   
max                   7.000000     11.000000      ...             32.060000   

       LP_CollectionFees  LP_GrossPrincipalLoss  LP_NetPrincipalLoss  \
count      113937.000000          113937.000000        113937.000000   
mean          -14.242698             700.446342           681.420499   
std           109.232758            2388.513831          2357.167068   
min         -9274.750000             -94.200000          -954.550000   
25%             0.000000               0.000000             0.000000   
50%             0.000000               0.000000             0.000000   
75%             0.000000               0.000000             0.000000   
max             0.000000           25000.000000         25000.000000   

       LP_NonPrincipalRecoverypayments  PercentFunded  Recommendations  \
count                    113937.000000  113937.000000    113937.000000   
mean                         25.142686       0.998584         0.048027   
std                         275.657937       0.017919         0.332353   
min                           0.000000       0.700000         0.000000   
25%                           0.000000       1.000000         0.000000   
50%                           0.000000       1.000000         0.000000   
75%                           0.000000       1.000000         0.000000   
max                       21117.900000       1.012500        39.000000   

       InvestmentFromFriendsCount  InvestmentFromFriendsAmount      Investors  
count               113937.000000                113937.000000  113937.000000  
mean                     0.023460                    16.550751      80.475228  
std                      0.232412                   294.545422     103.239020  
min                      0.000000                     0.000000       1.000000  
25%                      0.000000                     0.000000       2.000000  
50%                      0.000000                     0.000000      44.000000  
75%                      0.000000                     0.000000     115.000000  
max                     33.000000                 25000.000000    1189.000000  

[8 rows x 61 columns]
In [24]:
PLD.head(0)
Out[24]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors

0 rows × 81 columns

because Prosper has only been using its own proprietary Prosper Rating since 2009, we have a lot of missing ProsperRating column values. Let's get these missing values dropped:

In [25]:
PLD = PLD.dropna(subset=['ProsperRating (Alpha)']).reset_index()
In [26]:
PLD['LoanOriginationDate'] = pd.to_datetime(PLD['LoanOriginationDate'])
In [27]:
PLD.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84853 entries, 0 to 84852
Data columns (total 82 columns):
index                                  84853 non-null int64
ListingKey                             84853 non-null object
ListingNumber                          84853 non-null int64
ListingCreationDate                    84853 non-null object
CreditGrade                            0 non-null object
Term                                   84853 non-null int64
LoanStatus                             84047 non-null category
ClosedDate                             26005 non-null object
BorrowerAPR                            84853 non-null float64
BorrowerRate                           84853 non-null float64
LenderYield                            84853 non-null float64
EstimatedEffectiveYield                84853 non-null float64
EstimatedLoss                          84853 non-null float64
EstimatedReturn                        84853 non-null float64
ProsperRating (numeric)                84853 non-null float64
ProsperRating (Alpha)                  84853 non-null category
ProsperScore                           84853 non-null float64
ListingCategory (numeric)              84853 non-null int64
BorrowerState                          84853 non-null object
Occupation                             83520 non-null object
EmploymentStatus                       84204 non-null category
EmploymentStatusDuration               84834 non-null float64
IsBorrowerHomeowner                    84853 non-null bool
CurrentlyInGroup                       84853 non-null bool
GroupKey                               2026 non-null object
DateCreditPulled                       84853 non-null object
CreditScoreRangeLower                  84853 non-null float64
CreditScoreRangeUpper                  84853 non-null float64
FirstRecordedCreditLine                84853 non-null object
CurrentCreditLines                     84853 non-null float64
OpenCreditLines                        84853 non-null float64
TotalCreditLinespast7years             84853 non-null float64
OpenRevolvingAccounts                  84853 non-null int64
OpenRevolvingMonthlyPayment            84853 non-null float64
InquiriesLast6Months                   84853 non-null float64
TotalInquiries                         84853 non-null float64
CurrentDelinquencies                   84853 non-null float64
AmountDelinquent                       84853 non-null float64
DelinquenciesLast7Years                84853 non-null float64
PublicRecordsLast10Years               84853 non-null float64
PublicRecordsLast12Months              84853 non-null float64
RevolvingCreditBalance                 84853 non-null float64
BankcardUtilization                    84853 non-null float64
AvailableBankcardCredit                84853 non-null float64
TotalTrades                            84853 non-null float64
TradesNeverDelinquent (percentage)     84853 non-null float64
TradesOpenedLast6Months                84853 non-null float64
DebtToIncomeRatio                      77557 non-null float64
IncomeRange                            84853 non-null object
IncomeVerifiable                       84853 non-null bool
StatedMonthlyIncome                    84853 non-null float64
LoanKey                                84853 non-null object
TotalProsperLoans                      19797 non-null float64
TotalProsperPaymentsBilled             19797 non-null float64
OnTimeProsperPayments                  19797 non-null float64
ProsperPaymentsLessThanOneMonthLate    19797 non-null float64
ProsperPaymentsOneMonthPlusLate        19797 non-null float64
ProsperPrincipalBorrowed               19797 non-null float64
ProsperPrincipalOutstanding            19797 non-null float64
ScorexChangeAtTimeOfListing            16642 non-null float64
LoanCurrentDaysDelinquent              84853 non-null int64
LoanFirstDefaultedCycleNumber          6244 non-null float64
LoanMonthsSinceOrigination             84853 non-null int64
LoanNumber                             84853 non-null int64
LoanOriginalAmount                     84853 non-null int64
LoanOriginationDate                    84853 non-null datetime64[ns]
LoanOriginationQuarter                 84853 non-null object
MemberKey                              84853 non-null object
MonthlyLoanPayment                     84853 non-null float64
LP_CustomerPayments                    84853 non-null float64
LP_CustomerPrincipalPayments           84853 non-null float64
LP_InterestandFees                     84853 non-null float64
LP_ServiceFees                         84853 non-null float64
LP_CollectionFees                      84853 non-null float64
LP_GrossPrincipalLoss                  84853 non-null float64
LP_NetPrincipalLoss                    84853 non-null float64
LP_NonPrincipalRecoverypayments        84853 non-null float64
PercentFunded                          84853 non-null float64
Recommendations                        84853 non-null int64
InvestmentFromFriendsCount             84853 non-null int64
InvestmentFromFriendsAmount            84853 non-null float64
Investors                              84853 non-null int64
dtypes: bool(3), category(3), datetime64[ns](1), float64(50), int64(12), object(13)
memory usage: 49.7+ MB

because there is no previous credit history on Prosper the value for column TotalProsperLoans will be NaN, let's replace it with 0.

In [28]:
PLD['TotalProsperLoans'] = PLD['TotalProsperLoans'].fillna(0)
In [29]:
PLD.head(5)
Out[29]:
index ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
0 1 10273602499503308B223C1 1209647 2014-02-27 08:28:07.900000000 NaN 36 Current NaN 0.12016 0.0920 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
1 3 0EF5356002482715299901A 658116 2012-10-22 11:02:35.010000000 NaN 36 Current NaN 0.12528 0.0974 ... -108.01 0.0 0.0 0.0 0.0 1.0 0 0 0.0 158
2 4 0F023589499656230C5E3E2 909464 2013-09-14 18:38:39.097000000 NaN 36 Current NaN 0.24614 0.2085 ... -60.27 0.0 0.0 0.0 0.0 1.0 0 0 0.0 20
3 5 0F05359734824199381F61D 1074836 2013-12-14 08:26:37.093000000 NaN 60 Current NaN 0.15425 0.1314 ... -25.33 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
4 6 0F0A3576754255009D63151 750899 2013-04-12 09:52:56.147000000 NaN 36 Current NaN 0.31032 0.2712 ... -22.95 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1

5 rows × 82 columns

What is the structure of your dataset?

What is the structure of your dataset?

There are 113937 loan data in the dataset with 82 features (ListingKey, ListingNumber, ListingCreationDate, CreditGrade, Term, LoanStatus, ClosedDate, BorrowerAPR, BorrowerRate,...etc). ( 61 colum ) Most variables are numeric in nature, but the variables EmploymentStatus, LoanStatus, ProsperRating (Alpha), and BorrowerState are ordered factor variables

What is/are the main feature(s) of interest in your dataset?

I'm most interested in figuring out what features are best for predicting the proposer loans rate of success in the dataset.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

I expect that borrower APR will have the strongest effect on each rate of success proposer loans: the length of the loan expressed in months (term) ,The Borrower's interest rate for this loan (BorrowerRate) ,LoanOriginalAmount .

Univariate Exploration

I'll start by looking at the distribution of the main variable of interest: the proposer loans rate of success.

Univariate Exploration

In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.

In [17]:
binsize = 0.01
bins = np.arange(0, PLD['BorrowerRate'].max()+binsize, binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = PLD, x = 'BorrowerRate', bins = bins)
plt.xlabel('BorrowerRate')
plt.show()

The distribution of the borrower rate apears to be bimodal with first peak around 0.16, larger peak (true mode) around 0.32. Let's check number of occurrences:

In [4]:
binsize = 20
bins = np.arange(0, PLD['Term'].max()+binsize, binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = PLD, x = 'Term', bins = bins)
plt.xticks([12 ,36 ,60])
plt.xlabel('Term')
plt.show()

For loan terms there are three options: 36, 60 and 12. Most common is 36 months.

In [15]:
binsize = 0.025
bins = np.arange(0, PLD['BorrowerAPR'].max()+binsize, binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = PLD, x = 'BorrowerAPR', bins = bins)
plt.xlabel('BorrowerAPR')
plt.show()

the maximam APR rate are between 0.2 more or less

In [8]:
binsize = 1000
bins = np.arange(0, PLD['LoanOriginalAmount'].max()+binsize, binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = PLD, x = 'LoanOriginalAmount', bins = bins)
plt.xlabel('LoanOriginalAmount')
plt.show()

most loans are less than 10000 bound

In [31]:
# let's plot all three together to get an idea of each ordinal variable's distribution.

fig, ax = plt.subplots(nrows=4, figsize = [30,40])

default_color = sb.color_palette()[0]


sb.countplot(data = PLD, x = 'LoanStatus', color = default_color, ax = ax[0] )
#plt.xticks(rotation=50)
sb.countplot(data = PLD, x = 'ProsperRating (Alpha)', color = default_color, ax = ax[1])
sb.countplot(data = PLD, x = 'EmploymentStatus', color = default_color, ax = ax[2])
sb.countplot(data = PLD, x = 'CurrentlyInGroup', color = default_color, ax = ax[3])

for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=40)
plt.show()

tips:

  1. most of loans are currently paid or n un compiletly paid
  2. Interestingly enough ratings are almost normally distributed with the highest AA rating and the lowest rating HR are the least common and average rating C being the most common
  3. most loan carriyers (borrower) are employed
  4. most loan carriyers (borrower) are not in group
In [36]:
sb.countplot(data=PLD, x='ListingCategory (numeric)', color=default_color)
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x14752031b70>

The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans

The overwhelming majority of loans are used for debt consolidation. Other notable categories include Other, Auto, Home Improvement and Business.

In [38]:
sb.countplot(data=PLD, x='Recommendations', color=default_color);

Overwhelming majority of loans obtained without recomendations.

In [51]:
fig = plt.subplots(figsize = [10,8])

default_color = sb.color_palette()[0]

sb.countplot(data = PLD, x = 'ProsperScore', color = default_color )

plt.show()

ProsperScore :A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009. are normally disttributed.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

Prosper ratings are almost normally distributed.ProsperScore are normally distributed .The distribution of borrowers APR looks multimodal. Most of the values are at the range of 0.05 and 0.4. There are no unusual points and no need to perform any transformations.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

the count plots are normally distributed at most no need for any operations

Bivariate Exploration

In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).

In [32]:
numeric_vars = ['Term', 'BorrowerAPR', 'BorrowerRate', 'LenderYield', 'EstimatedEffectiveYield', 'EstimatedLoss', 'StatedMonthlyIncome','LoanOriginalAmount',
               'ProsperScore']
categoric_vars = ['LoanStatus','ProsperRating (Alpha)','EmploymentStatus' ,'ProsperRating (Alpha)','IsBorrowerHomeowner','CurrentlyInGroup']
In [33]:
# correlation plot
plt.figure(figsize = [8, 5])
sb.heatmap(PLD[numeric_vars].corr(), annot = True, fmt = '.3f',
           cmap = 'vlag_r', center = 0)
plt.show()

BorrowerAPR and ProsperScore are negative because borrowers with lower score are more likely to pay higher APR. Similarly, higher CreditScore means the borrowers are more trustworthy, therefore it recevied lower APR.

In [34]:
# plot matrix: sample 500 diamonds so that plots are clearer and
# they render faster
samples = np.random.choice(PLD.shape[0], 500, replace = False)
PLD_samp = PLD.loc[samples,:]

g = sb.PairGrid(data = PLD_samp, vars = numeric_vars)
g = g.map_diag(plt.hist, bins = 20);
g.map_offdiag(plt.scatter)
Out[34]:
<seaborn.axisgrid.PairGrid at 0x2ac4b950cc0>

Matrix Plot: Similar to the correlation plot, we can determine which pair has negative or positive relationships from analyzing the pattern in each scatter plots. ProsperScore seems to be more related to BorrowerAPR compare to other variables. StatedMonthlyIncome does not give useful information on BorrowerAPR and will not be further analyzed.

In [54]:
# plot matrix of numeric features against categorical features.
# can use a larger sample since there are fewer plots and they're simpler in nature.

samples = np.random.choice(PLD.shape[0], 2000, replace = False)
PLD_samp = PLD.loc[samples,:]

def boxgrid(x, y, **kwargs):
    """ Quick hack for creating box plots with seaborn's PairGrid. """
    default_color = sb.color_palette()[0]
    sb.boxplot(x, y, color = default_color)

plt.figure(figsize = [10, 10])
g = sb.PairGrid(data = PLD_samp, y_vars = numeric_vars , x_vars = categoric_vars,
                size = 3, aspect = 1.5)
g.map(boxgrid)
plt.show();
C:\Users\mohamed3del\Anaconda3\lib\site-packages\seaborn\axisgrid.py:1241: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(UserWarning(msg))
<Figure size 720x720 with 0 Axes>

The figure shows that the loan amount is increased with the increase of loan term. The borrower APR decreases with the better rating. Borrowers with the best Prosper ratings have the lowest APR. It means that the Prosper rating has a strong effect on borrower APR. Borrowers with better rating also have larger monthly income and loan amount. Employed, self-employed and full time borrowers have more monthly income and loan amount than part-time, retired and not employed borrowers.

In [55]:
# since there's only three subplots to create, using the full data should be fine.
plt.figure(figsize = [15, 15])
x =['LoanStatus','ProsperRating (Alpha)','EmploymentStatus' ,'ProsperRating (Alpha)','IsBorrowerHomeowner','CurrentlyInGroup']
# subplot 1: LoanStatus vs EmploymentStatus
plt.subplot(4, 1, 1)
sb.countplot(data = PLD, x = 'EmploymentStatus', hue = 'LoanStatus', palette = 'Blues')

# subplot 2: LoanStatus vs. ProsperRating (Alpha)
ax = plt.subplot(4, 1, 2)
sb.countplot(data = PLD, x = 'ProsperRating (Alpha)', hue = 'LoanStatus', palette = 'Blues')
ax.legend(ncol = 1) # re-arrange legend to reduce overlapping

# subplot 3: EmploymentStatus vs. IsBorrowerHomeowner, use different color palette
ax = plt.subplot(4, 1, 3)
sb.countplot(data = PLD, x = 'EmploymentStatus', hue = 'IsBorrowerHomeowner', palette = 'Greens')
ax.legend(loc = 1, ncol = 1) # re-arrange legend to remove overlapping

# subplot 1: LoanStatus vs EmploymentStatus
plt.subplot(4, 1, 4)
sb.countplot(data = PLD, x = 'CurrentlyInGroup', hue = 'LoanStatus', palette = 'Blues')

plt.show()

most loaonstatus as current are for employed ones , has a 'c' rate for porsperRating (ALpha) and that employed ones most of them owend their iwen homes and not currently in group

In [17]:
plt.figure(figsize = [8, 6])
plt.scatter(data = PLD, x = 'Term', y = 'EstimatedLoss' ,alpha= 1/10)
#plt.xlim([0, 3.5])
plt.xlabel('Term')
plt.xticks([12 ,36,60])
plt.yscale('log')
plt.ylabel('EstimatedLoss')
plt.show()
In [18]:
plt.figure(figsize = [8, 6])
plt.scatter(data = PLD, x = 'BorrowerRate', y = 'EstimatedLoss' ,alpha= 1/20)
#plt.xlim([0, 3.5])
plt.xlabel('BorrowerRate')
#plt.xticks([12 ,36,60])
#plt.yscale('log')
plt.ylabel('EstimatedLoss')
plt.show()
In [56]:
plt.figure(figsize = [8, 6])
sb.regplot(data = PLD, x = 'LoanOriginalAmount', y = 'BorrowerAPR', scatter_kws={'alpha':0.01});`

This plot shows that at different size of the loan amount, the APR has a large range, but the range of APR decrease with the increase of loan amount. Overall, the borrower APR is negatively correlated with loan amount.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The APR of the borrower is negatively correlated with the original amount of the loan, which means that the larger the amount of the loan, the lower the APR. It also indicates that the APR has a wide range at the different size of the loan amount, but with the rise in the loan amount, the range of APR decreases. The rating of Prosper also has a clear influence on the APR of the applicant, which decreases with the better rating.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The original amount of the loan is positively associated with the monthly income reported, because borrowers with more monthly income might lend more money, it makes sense. It also illustrates that better-rated borrowers also have greater monthly income and loan amounts. There is a relationship between a performance rating and a term. Proportionally, on B and C ratings, there are more 60-month loans. There are just 36 months of loans for borrowers with HR ratings.

Multivariate Exploration

Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.

In [36]:
# Term effect on relationship of APR and loan amount
F=sb.FacetGrid(data=PLD, aspect=1.2, height=5, col='Term', col_wrap=4)
F.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerAPR', x_jitter=0.04, scatter_kws={'alpha':0.1});
F.add_legend();
C:\Users\mohamed3del\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Term doesn't seem to have effect on relationship of APR and loan amount

In [37]:
# Prosper rating effect on relationship of APR and loan amount
H=sb.FacetGrid(data=PLD, aspect=1.2, height=5, col='ProsperRating (Alpha)', col_wrap=4)
H.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerAPR', x_jitter=0.04, scatter_kws={'alpha':0.1});
H.add_legend();

With prosper rating, the loan sum and borrower APR have a negative relationship and turns marginally to likely as prosperRating is lifted from HR to A or better. This could be because individuals with A or AA ratings want to borrow more cash, raising APR that may deter them from borrowing any further and optimising benefit. But lower-rated individuals prefer to borrow less money, and a decrease in APR might allow them to borrow more.

In [35]:
fig = plt.figure(figsize = [8,6])
ax = sb.pointplot(data = PLD, x = 'ProsperRating (Alpha)', y = 'BorrowerAPR', hue = 'Term',
           palette = 'Blues', linestyles = '', dodge = 0.4, ci='sd')
plt.title('Borrower APR across rating and term')
plt.ylabel('Mean Borrower APR')
ax.set_yticklabels([],minor = True);

Interestingly, the borrower APR decrease with the increase of borrow term for people with HR-C raings. But for people with B-AA ratings, the APR increase with the increase of borrow term.

In [64]:
fig, ax = plt.subplots(ncols=2, figsize=[12,6])
sb.pointplot(data = PLD, x = 'ProsperRating (Alpha)', y = 'StatedMonthlyIncome', hue = 'Term',
           palette = 'Greens', linestyles = '', dodge = 0.4, ax=ax[0])
sb.pointplot(data = PLD, x = 'ProsperRating (Alpha)', y = 'LoanOriginalAmount', hue = 'Term',
           palette = 'Purples', linestyles = '', dodge = 0.4, ax=ax[1]);

there is a interaction between term and rating. We will see that with better Prosper rating, the loan amount of all three terms increases, the amplitude of loan amount between terms also becomes larger and it is increasing. it doesn't seem like there is a interaction effect between term and rating, the consequences of term is similar among different ratings.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I make my exploration analysis of borrower APR against loan amount by looking at the impact of the Prosper rating. The multivariate exploration showed that the relationship between borrower APR and loan amount turns from negative to slightly positive when the Prosper ratings increased from HR to AA I ,then i visualize the rating and term effects on loan amount, it is showing that if you have good Prosper rating, the loan amount of all three terms will be increase

Were there any interesting or surprising interactions between features?

there were a surprising interaction is that the borrower APR and loan amount is negatively correlated when the Prosper ratings are from HR to B, but the correlation is turned to be positive when the ratings are A and AA. Another interesting thing is that the borrower APR decrease with the increase of borrow term for people with HR-C raings. But for people with B-AA ratings, the APR increase with the borrow term.

At the end of your report, make sure that you export the notebook as an html file from the File > Download as... > HTML menu. Make sure you keep track of where the exported file goes, so you can put it in the same folder as this notebook for project submission. Also, make sure you remove all of the quote-formatted guide notes like this one before you finish your report!

In [ ]: